Load the libraries

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
## The following object is masked from 'package:magrittr':
## 
##     extract
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## Loading required package: lattice
## Loading required package: ParamHelpers
## 'mlr' is in maintenance mode since July 2019. Future development
## efforts will go into its successor 'mlr3' (<https://mlr3.mlr-org.com>).
## 
## Attaching package: 'mlr'
## The following object is masked from 'package:caret':
## 
##     train
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
## Loading required package: itertools
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: msm
## Loading required package: polycor

Abstract

Introduction

According to the anti malware company Malwarebytes, “Spam is any kind of unwanted, unsolicited digital communication, often an email, that gets sent out in bulk. Spam is a huge waste of time and resources.” At its most benign, spam is the digital equivalent of junk mail costing receipents nothing more than time and aggravation. However, spam emails can expose individuals and companies to a litany of attacks including phishing, ransomware, and most recently cryptojacking. These cyber attacks can result in significant costs to companies in the form of lost time, revenue, resources, and intelectual property. Without a good spam filter even the most sophisticated layered cyber defense system is vulnerable to a cyber attack that originates from malicious email carelessly opened by a trusting employee. For this reason spam filters are just as important to cyber security as firewalls .

The simplest forms of filters are list-based filters which as the name implies, take lists of words from a combination of black lists, grey lists, and white lists and compares the words in an email to its list, to determine if the email should be blocked, flagged, or allowed. These filters require continous list updates and result in both high false positve and false negative designations. A more effective filter is the heuristic filter that makes use of statistical methods and machine learning algorithms to determine the probablity that an individual email is spam. In this study we will create a decision tree based spam filter and compare it to a naive bayes based filter to determine the accuracy and precision of each.

Background

Spam is generally defined as an email that has been sent en masse to many users with no commercial purpose. Spam emails can also include messages containing attachments that spread viruses through emails. Cleanrly, spam has become a major problem for users and businessness, which led to the advent of spam filters in email systems. In the past, these filters relied on keywords within the message to identify the spam. This can include list-based filters which can classify the sender, content-based filters, and collaborative filters where users report spam messages which are stored in the database.

Typically, researchers usee a taxonomy based on these web filters to present spam from spreading, and there are many different types of classification methods used to detect spam. Methods such as random forests, support vector machines, naive bayes and decision trees have been use to parse through these taxonomy filters and identify which play the greatest role in defining spam. This project will leverage decision trees so that we can build a taxonomy flow chart to better understand how the different email features relate to one another in identifying spam.

A decision tree is a graphing and modeling tool that is designed to visualize and structure models based on a given set of rules. We can use it to classify unlabled data using the remaining labled features to detect patterns. In the case of our email spam detection project, we will be primarily utilizing rpart (recursive partitioning package in R) as our primary modeling method. This method outputs a decision tree made up of the features and thier values as the leaves. Recursive partitioning is useful because it can be done for both classification and regression and can easily utilize categorical and continuous variables.

Rpart is one of the most commonly used machine learning packages available in R and you will see that it’s faily easily implemented, and can be easily tuned so that we can build an optimized model to improve our classification accuracy. Rpart uses a non-parametic CART algorithm, which will use the Gini Index as the splitting criteria.

Data

setwd("/Users/danielclark/Desktop/SMU/Quantifying_the_World/Unit 5/Week_5_Materials")
#setwd("C:/Users/Akuma2099/MachineLearning/QTW_Project_3")
load("data.Rda") 
load("resampling_binary")
load("spam.tuned.clf")

ls()
## [1] "emailDFrp"       "spam.resamp.bin" "spam.tuned.clf"

The data consists of 9348 observations. Each observation represents one email message with up to 30 variables. The first variable, “isSpam” is a 2 level categorical variable that denotes whether the message was known to be spam. This will serve as the response variable for the study. The remaining variables consist of 16 categorical and 13 numeric variables that repesent characteristics of the email such such as “isRe” which denotes if an email contains the “RE” prefix in the subject line, “hour” which denotes the hour of the day in which the email was received, “bodyCharCt” which represents the number of characters in the body of the message, and “perCaps” which represents the percent of characters that where capitalized.

## 'data.frame':    9348 obs. of  30 variables:
##  $ isSpam       : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ isRe         : Factor w/ 2 levels "F","T": 2 1 1 1 2 2 1 2 1 2 ...
##  $ underscore   : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ priority     : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ isInReplyTo  : Factor w/ 2 levels "F","T": 2 1 1 1 1 2 1 1 1 2 ...
##  $ sortedRec    : Factor w/ 2 levels "F","T": 2 2 2 2 2 2 2 2 2 2 ...
##  $ subPunc      : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ multipartText: Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hasImages    : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ isPGPsigned  : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subSpamWords : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ noHost       : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ numEnd       : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ isYelling    : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ isOrigMsg    : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ isDear       : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 1 1 1 ...
##  $ isWrote      : Factor w/ 2 levels "F","T": 1 1 1 1 1 1 1 2 1 1 ...
##  $ numLines     : int  50 26 38 32 31 25 38 39 126 50 ...
##  $ bodyCharCt   : int  1554 873 1713 1095 1021 718 1288 1182 5989 1554 ...
##  $ subExcCt     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ subQuesCt    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ numAtt       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ numRec       : int  2 1 1 0 1 1 1 1 1 2 ...
##  $ perCaps      : num  4.45 7.49 7.44 5.09 6.12 ...
##  $ hour         : num  11 11 12 13 13 13 13 14 14 11 ...
##  $ perHTML      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ subBlanks    : num  12.5 8 8 18.9 15.2 ...
##  $ forwards     : num  0 0 0 3.12 6.45 ...
##  $ avgWordLen   : num  4.38 4.56 4.82 4.71 4.23 ...
##  $ numDlr       : int  3 0 0 0 0 0 0 0 0 3 ...

For better interpretation we renamed and releveled the resonse varaible from “T” and “F” to “Spam” and “valid”.

emailDFrp$isSpam <- emailDFrp$isSpam %>% 
                      revalue(c("T"="Spam", "F"="Valid")) %>% 
                        relevel("Spam")

A check of NA values by column indicated that there were up to 363 obeservations with missing values in the following seven variables:

This indicates that up to 3.9% of our observations could have missing values.

sapply(emailDFrp, function(x) sum(is.na(x)))
##        isSpam          isRe    underscore      priority   isInReplyTo 
##             0             0             0             0             0 
##     sortedRec       subPunc multipartText     hasImages   isPGPsigned 
##             0             0             0             0             0 
##  subSpamWords        noHost        numEnd     isYelling     isOrigMsg 
##             7             1             0             7             0 
##        isDear       isWrote      numLines    bodyCharCt      subExcCt 
##             0             0             0             0            20 
##     subQuesCt        numAtt        numRec       perCaps          hour 
##            20             0           282             0             0 
##       perHTML     subBlanks      forwards    avgWordLen        numDlr 
##             0            20             0             0             0

Though dropping the missing values was an option, for this study we decided to impute the missing values based on random forest classification and regesssion using a parallel method.

registerDoParallel(cores=4)

df <- missForest(emailDFrp, 
                 maxiter=5, 
                 ntree=200, 
                 parallelize = c('forests'),
                 variablewise = TRUE)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
# establish imputed set
emailDFrp <- df$ximp

The imputed values were merged into the original dataset to create a complete dataset with no missing values.

sum(is.na(emailDFrp))
## [1] 0

Methods

Prior to building our rpart algorithm to classify spam and valid emails, we will explore our dataset to detect some trends that we can potentially leverage for our modeling. We will explore the correlation and independence between our predictor variables. We will also look at the relationship between our predictor variables and our isSpam response variable. We will also need to account for the 30 continuous and boolean factor variables and ensure each are being used in our model to maximize the performance. We will also leverage the rpart tuning parameters to identify the optimal features for identifying spam and valid emails as well as the optimal variable branches in our decision tree.

Exploratory Data Analysis

After ensuring that no missing values existed in the data set we began our exploratory data analysis by examining the distribution of the response variable.

The chart above shows that we have an unbalanced dataset with nearly 2,400 spam emails compared to nearly 7,000 nonspam emails (valid) emails. This means that, roughly, 1 out of every 4 emails in our set are considered spam. This fact highlights the possible need to account for the imbalance in our future modeling, and ensure that we are tuning and training our models such that they are not being rewarded for leaning too much on assigning the valid class to new valid. That said, since the unbalance isn’t huge, we will not need to apply oversampling mehtods to address this issue.

A quick review of the numeric variables indicates that there is a great degree of variation both within and between the individual variables. This indicates that normalization or standardization might be necessary.

##     numLines         bodyCharCt        subExcCt         subQuesCt      
##  Min.   :   2.00   Min.   :     6   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.:  19.00   1st Qu.:   587   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median :  32.00   Median :  1088   Median : 0.0000   Median : 0.0000  
##  Mean   :  66.91   Mean   :  2844   Mean   : 0.1315   Mean   : 0.1378  
##  3rd Qu.:  59.00   3rd Qu.:  2192   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :6319.00   Max.   :188505   Max.   :42.0000   Max.   :12.0000  
##      numAtt             numRec           perCaps             hour      
##  Min.   : 0.00000   Min.   :  0.000   Min.   :  0.000   Min.   : 0.00  
##  1st Qu.: 0.00000   1st Qu.:  1.000   1st Qu.:  4.255   1st Qu.: 8.00  
##  Median : 0.00000   Median :  1.000   Median :  6.055   Median :13.00  
##  Mean   : 0.06579   Mean   :  1.918   Mean   :  8.850   Mean   :12.21  
##  3rd Qu.: 0.00000   3rd Qu.:  1.394   3rd Qu.:  9.059   3rd Qu.:18.00  
##  Max.   :18.00000   Max.   :311.000   Max.   :100.000   Max.   :23.00  
##     perHTML          subBlanks        forwards       avgWordLen    
##  Min.   :  0.000   Min.   : 0.00   Min.   : 0.00   Min.   : 1.363  
##  1st Qu.:  0.000   1st Qu.:10.53   1st Qu.: 0.00   1st Qu.: 4.208  
##  Median :  0.000   Median :13.24   Median : 0.00   Median : 4.455  
##  Mean   :  6.517   Mean   :13.87   Mean   :10.45   Mean   : 4.487  
##  3rd Qu.:  0.000   3rd Qu.:15.69   3rd Qu.:15.38   3rd Qu.: 4.729  
##  Max.   :100.000   Max.   :86.42   Max.   :99.06   Max.   :26.000  
##      numDlr        
##  Min.   :   0.000  
##  1st Qu.:   0.000  
##  Median :   0.000  
##  Mean   :   1.782  
##  3rd Qu.:   0.000  
##  Max.   :1977.000

Explanatory Variable Relationships

With Regards to our imputed dataset, the correlation matrix shows that slight positive and negative correlations exist between our numeric predictors, with the only strong (.92) multicolinear relationship being between “bodyCharCt” and “numLines”.

Looking at the data, we can see three addtional positive relationships can be pointed out.

  • perHTML and numLines (.30)
  • perHTML and bodyCahrCt (.38)
  • numDir and subQuesCt (.37)

While not very high, the correlation of these predictors can potentially be overweighed in our modeling procedures, due to the fact that they might not be independent. For example, numLines and bodyCharct are both functions of the length of the email in question, so they are very similar metrics. However, if we use recursive partitioning in our modeling, the collinearity between these pairs of variables will be accounted for, by selecting the most important variable if similar variables are found.

Looking at the non-numerical values in our dataset, we can see that we have 16 boolean values that we can factor our model into as well. To do so, we will employ the Fisher’s exact p-test to show the resulting p-values for our dichotomous variables, which we will use as a numerical comparison similar to how we used correlation on our numerical variables.

The lower p values indicate that we reject the null of a randomized assocation between dichotomous variables. Here we can see that there are some large non random dependencies for variables such as “isWrote” which indicate whether an email is electronically scribed. Since this is apparent in almost all instances, we can likely remove. HOwever, there are some instances for variables such as “priority” and “noHost” which may be interesting for classifying spam or not spam. This would make sense as the lack of host name and sender are variables written by the email sender.

We can also visually review the correlation between factors and continuous variables using a biserial correlation (as we have dichotomous factors for all of our non-continuous variables). Upon review, we can see some relationships emerge.

Reviewing the plot further, we can see that the number of attachments (numAtt) has a negative correlation with the boolean value of “multipartText”. Multipart text messages typically do not typically have attachments. In addition, we are seeing that the variable for number of forwards has a negative correlation to “isInReplyTo”, which suggests that the replies do not typically have a ton of forwards.

Overall, we will look into ways that continuous and categorical variables can be used to predict the response variable “isSpam”" while also pulling out the variable importances in the rpart package.

Response Variable Relationships

As we mentioned before, the majority of instances in our dataset are not considered spam. That said, we can visualize the relationships between spam and valid emails for the categorical and continuous predictor variables using different plotting techniques.

First, we will look further into “isRe”, “numEnd”, “subSpamWords” and “isWrote”. The variable “numEnd” indicates whether or not the ‘from’ email prefix ends with a number, such as ‘.’

In the chart above, subSpamWords is a boolean that indicates when a known “spam word” is indicated in the subject of an email, such as “viagra” which would trigger a value for subSpamwords. The majority of our spam cases happen when our factors are set to false. Additionally, the mostly valid emails are related to instances when “isRe” and “isWrote” are set to true. This will prove to be usefull when seeing how a decision tree can be split between our categorical variables in the decision function to determine if an email is spam or not spam.

We will also look into the separation of classes for numeric variables by looking at the log values for our numeric placements in a box plots. We will review some of the interesting numerics in the figure below.

The “forwards” predictor variable, shows a great concentration of value distribution for the third quartile of messages that are labeled as valid. The “perCaps” predictor variable shows a larger predictor interval for spam than valid emails. We can also see that the median value for spam is higher than the valid messages. On “perHTML”, the mean and median value for valid emails is nearly zero while the range is much greater for the spam messages.

Examination of these predictor variables indicates that we may have some leads to uncovering a predictor and an important features related to detecting spam.

Modeling

Base Model Naive Bayes

Variable Selection and Model Comparison Setup

In hopes to avoid the complexity of variable selection algorithms to run an algorithm against our email data set, we will set up a fit of an rpart model using the training data an all 29 features and default model parameters. For the default parameters for rpart, we will use a minsplit of 20 along with a complexity parameter of 0.01 with a max depth of 30. The splitting criteria will use the default Gini Index.

for our setup, we will use an 80/20 split between the training and testing set. Since we have an imbalanced relationship between spam emails and valid emails, we can use a stratified sampling of the observations in our training and testing data sets to ensure we maintain the original balance. We have 9,348 total observations, so this would mean that our testing set will have roughly 1,800 observations.

As we will run a series of models in our experiment, we will maintain this distribution of training and testing data for each model to ensure that we are running a valid experiment. As our rpart is trained, it will provide a listing of variables that are playing the greatest role in deciding upon valid or spam email. The chart below provides the most important features using the rpart base model.

Figure 7 above, indicates that “perCaps” is the most important feature in defining the spam messages using the rpart base model. “perCaps” is the percentage of capital alpha characters in the body of the email. For this model, the importance is weighted based on the sum of the impurity for each variable split. Secondly and thirdly, we can see that “numLines” and “bodyCharCt” hold the second and third most important variables in our model.

We will also look into the rpart control parameters and their ability to classify spam emails based on fitting a separate, optimized rpart model. To do this, we will look into 4 different parameters for tuning. Below, we outline the description of these.

  • Complexity parameter (cp) - A scaled complexity penalty that ranges from 0 to 1. cp is compared against the error rate related to the previous split. Any split that doesn’t decrease the error rate is not considered.
  • Minsplit - the minimum number of observations that need to exist in a node in order for the split to be attempted.
  • Maxdepth - The maximum depth of any node of the final tree, with the root node counted at depth 0.
  • Splitting criteria - or the gini or information. It utilizes the gini index to optimize split points and entropy and information gain.

We will use decision trees to ensure we are using these parameters in such a way that we don’t overfit.

Hyperparameter Optimization

We will be exploring a discrete list of the four parameters of interest to help ensure we are running the models as quickly and efficiently as possible. A grid search procedure will be used in conjuction with a ten-fold cross-validation.

We are going to be using our 4 panel procedure for evaluating classification performance and maximizing true positive classification where “spam” is the positive class. We will measure this using an ROC (AUC) curve and determining which model provides us the greatest area under the curve. A false positive would mean that a valid email will be marked as spam. A false negative would mean that a spam message ends up in the important inbox. The latter two would mean that we have a model error.

Base Model Results

Our base model listed “perCaps” as our most important feature followed by “BodyCharCt”. Given this, we will look at an rpart model that uses only “perCaps”" and “BodyCharCt”.

The plot above provides a log scale visualization of the classificaiton regions. The observations in white are misclassifications while hte color boundaries represents the outcomes for the spam email. The lighter shades of blue and pink represent the lower probabilities for the perCaps and bodyCharCt classes. As we can see, using only two features in our model, we are going to see a lot of false positives as indicated by the number of circles in the blue section. To improve upon this, we will need to include more variables to increase performance, however, these two seem to be a good starting point.

From here, we will use rpart to build a decision tree with 14 splits as represented below. We will start with the default decision tree as discussed in the previous section, and fit it with the training data set.

We can see from the decision tree above that we are using “bodyCharCt” multiple times in the splitting process for the training data. Additionally, we can see there is a bit of misclassification particularly as you go down the tree. Particularly in the sense of when we incorrectly classify spam emails as valid.

We will now leverage the test data to produce a confusion matrix to detirmine the model performance in the table below. As mentioned previously, we will be reviewing false positive rate, false negative rate, MMCE (model misclassification error) and ROC AUC.

##         predicted
## true     Spam Valid -err.-
##   Spam    388    91     91
##   Valid    75  1315     75
##   -err.-   75    91    166
##        auc       mmce        fpr        fnr 
## 0.93449257 0.08881755 0.05395683 0.18997912

So we can have a general error and correction rate, we will be using MMCE and AUC as our KPIs on performance, as they generally indicate how well we are correctly classifiying spam and valid emails. As a diagnostic measure, we will be looking at FPR and FNR to keep our models honest. Looking at our confusion matrix above, we can see that while we do generally well in AUC, we generally struggle with false negative rates. This means that we are misclassifying 91 spam observations out of the 479 in our training set for a false negative rate of nearly 19%. MMCE is also not particularly good at 9%. Our aim moving forward will be to improve using rpart’s hyperparameters.

Hyperparameter Tuning

To help improve our MMCE and FNR, we will be looking at making adjustments to our complexity parameter, minimum split, maximum depth and the splitting criterion. To avoid running the model over and over again and to lighten the code usage, we will be using a grid search procedure to test the Rpart on various tuning combinations of our parameters. Below is a list of our parameters we will be exploring.

Parameter Search Criteria complexity parameter (cp) -0.001, 0.01, 0.1, 0.2, 0.5, minsplit - 1, 5, 10, 15, 20, 30 maxdepth - 1, 5, 10, 15, 20, 30 splitting criterion - gini, information

On each parameter we ran, we performed a cross validation procedure using 10 folds of the training data using AUC as our key performance indicator.

The optimization procedure starts to max out at roughly 0.97 of an AUC using our training data, and our optimized tree is showing to be much larger than our base model. With our plot above, we can see that we are using 50 to 100 total splits in our optimized tree. This is due to the fact that our complexity penalty is lower than our optimized gridsearch threshold at (0.01). This may be causing an overfit to our test set.

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Our resulting model we ran generated a AUC score of greater than 0.96 with a complexity of 0.001. Additionally, the maxdepth for the decision tree nod is always 10 or higher. About half of our models use the information splitting criterion and the majority of the models use a minsplit of 10 or more. Overall, our optimizations lean towards greater splitting and a medium node depth.

We can n see the complexity parameter’s dominance in the figure below. The higher AUC is related to a lower CP (or complexity parameter).

## Warning: Raster pixels are placed at uneven vertical intervals and will be
## shifted. Consider using geom_tile() instead.

After exploring the rpart tuning parameters during grid sarch, we significantly increased the AUC score on our optimized model from 0.934 to 0.966, as well as decreased our MMCE from 0.089 to 0.054. IN our base model, we ran into issues with a large true negative rate (classifying spam emails as legit), which was reduced most sigfincantly of all, going from 0.190 to 0.129.

optimized base
auc 0.966 0.934
mmce 0.054 0.089
fpr 0.027 0.054
fnr 0.129 0.190

Ensemble Hyper Parameter Tuning

In a parallel package (https://github.com/dhyanshah/MS7333_QTW/blob/master/Case5/Case5-Spam.ipynb) we explored a series of models and estimator tuning procedures to test which model and tuning packages would generate the highest AUC score.

Ensemble_Optimized optimized base
auc 0.720 0.966 0.934
mmce 0.399 0.054 0.089
fpr 0.523 0.027 0.054
fnr 0.038 0.129 0.190

While our grid searched optimized model exceeded our base model on each metric, we will want to visualize the optimizations using an ROC curve, which you can see below.

Above, we can visually tell that our optimized rpart model outperforms the base rpart (with default parameters) with the more area under the curve (blue line compared to the red line). As the ROC AUC plot is made up of plotting the false positive rate with the false negative rate, we can visually see that the place where our optimized model most signficantly exceeds our base model is through it’s ability to reduce true negatives that were unseen in the base model.

# which variables are most important?
dat <- data.frame(vars=names(splits$variable.importance), 
                  importance=splits$variable.importance)
# plot the feature importances
ggplot(dat, aes(reorder(vars, importance, sum), importance))+
    coord_flip()+
    geom_col()+
    theme(legend.position = "bottom", 
            legend.text=element_text(size=8),
        legend.title = element_text(size=10),
        axis.text.x = element_text(angle=90, vjust=0.5),
        text=element_text(size=14))+
    ggtitle("Optimized rpart Feature Importance")+
    xlab("Feature Splits")+
    ylab("Importance Value")

Reviewing the Otpimized rpart Feature Importance, we can see that along with the “perCaps”" and the “bodyCharCt”, one of the emerging leaders with feature importance is the number of forwards in an email in determining whether it is valid or spam. Reviewing the decision tree plot above, we can see that if the number of forwards exceeds 6 times, then an email has been valid in previous instances.

Even though our optimized decision tree is complex, we can still see some of the key features that define the classification of spam and valid emails (to a success rate of 96%). The number of forwards in the email was the biggest indicator in classifying spam vs valid. Reviewing it in our decision tree, we can see that when an email gets forwarded more than 7 times, it is related to 66% of spam instances. The next most important features, bodyCharCt, numLines and perCaps were the strongest indicators of spam vs not spam. With regards to PerCaps (which measure the percent of capital letters in an email message, we can see via our decision tree that if the percentage is above 15%, it has a greater likelihood of being spam.

Conclusion

Through a detailed exploratory data analysis, rpart modeling, testing of various models and hyperparameter tuning, we were able to leverage rpart’s baseline tuning to achieve a high auc and a low misclassification rate, false positive rate, and true negative rate. Then we were able to turn our parameters to achieve nearly a 97% accuracy in partitioning between spam and valid email messages. However, when we applied the same hyperparameter tunes that we found in our python modeling ensemble, we were not able to achieve the same results as our optimized grid search. We determined that the complexity penalty that we applied to our rpart plays a strong role in our optimized AUC score given our analysis. Inthe end, using our optimized model, we found that our partitioning was made up into a relatively complex (and difficult to read) rpart decision tree.

If we had a dataset that was more balanced between spam and valid emails, while also being large enough to parse between the number of decision tree combinations to effectively classify between spam and valid emails. Looking further into the decision thresholds and learning curves should help with optimizing the classification criteria and the training data to fit an optimal rpart model.

Deployment

The borders of what we are classifying as “spam” is blurred between those that are accepted by the users, so that means there’s a balance that needs to be struck between being too lenient and just classifying commercial emails as spam, and too egregious so that we are sending important emails to the junk folder. As the spam mailers are getting more savvy with dodging these algorithms used to detect spam while also being able to send messages en masse, this would mean that our algorithm would need to be constantly updated with new data to ensure that it’s detecting and learning from the the latest trends and tactics.

These findings don’t just have value for the email service providers, but also to email writers as well. Knowing the algorithms that email servers use to parse between spam and valid emails, will help marketers and users better craft their emails in such a way that will get through the spam filters and reach their desired audiences. This would mean focusing on elements such as avoiding over capitalization and writing a long body email (using a lot of characters) will look more valid to email filters.

References

Nolan, D., Temple Lang, D. DATA SCIENCE IN R: a Case Studies Approach to Computational Reasoning and Problem Solving. CRC PRESS, 2017.